Introduction to scRNAseq and data pre-processing

link to powerpoint slides: A power point presentation will describe the general principles of single-cell ‘omics’ and basics about the platforms (~15 - 20 minutes)


1. FASTQ Files, genome alignments, and deconvolution of cells

This workshop provides an overview of key quality control (QC) metrics and processing steps in the initial steps of single-cell transcriptomic data analysis. While detailed tutorials for processing FASTQ files, performing genome alignments, and deconvoluting cells are available through various resources, we’ll focus on understanding the fundamental QC metrics and data structures.

Links to alignment preliminary analysis tutorials:

  • Vendor-specific pipeline resources (e.g., 10X Genomics)

  • Galaxy

  • Google colab hands-on experience (under construction)

1.1. Understanding the ‘knee plot’

Upon receiving aligned single-cell/nuclei sequencing data, one of the initial plots to examine is the ‘knee plot,’ commonly generated during analysis pipelines. This plot helps distinguish between barcodes representing cells and those representing background noise or empty Gel Emulsion beads (GEMs).

Gel Emulsion beads (GEMs) are microfluidic droplets used in platforms like 10X Genomics, each potentially containing a cell or nuclei. Distinguishing between GEMs with cells and empty GEMs is crucial for accurate analysis.

The knee plot visually depicts how a threshold is set to differentiate between cell barcodes and background noise. In platforms like 10X Genomics, cells within GEMs produce a larger number of unique mRNA molecules (UMIs) compared to empty GEMs. The knee plot typically shows a sudden drop (knee) in UMIs, indicating the separation between GEMs containing cells and those lacking them.

Figure from Danielski, K. (2023). Guidance on Processing the 10x Genomics Single Cell Gene Expression Assay. In: Calogero, R.A., Benes, V. (eds) Single Cell Transcriptomics. Methods in Molecular Biology, vol 2584. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2756-3_1

Interpreting the knee plot involves identifying this knee point. A distinct knee suggests successful separation, while the absence of a clear knee could indicate issues such as low RNA integrity or high ambient RNA contamination.

Figure from https://isoseq.how/umi/cell-calling.html


2. Setting up our R environment for data exploration

In preparation for the following hands-on experiences we will set up an R environment which has everything we need.

2.1. Load libraries

suppressPackageStartupMessages({
  library(Seurat)
  library(ggplot2)
  library(tidyr)
  library(scDblFinder)
  library(SingleCellExperiment)
})

2.2. Import data

For this tutorial we will use raw unprocessed data from human peripheral blood mononuclear cells (PBMCs) generated using the 10X Genomics platform which can be download from the Seurat page. These data should be downloaded, extracted, and placed in the downloads folder for this workshop.

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "./downloads/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
seurat_object <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", 
                                    min.cells = 3, min.features = 200)
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')

Note: We apply some preliminary filters as we know that GEMs which very low numbers of genes or genes which are barely detected are unlikely to be actual cells.

2.3. Check that everything was loaded correctly

print(seurat_object)
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
 1 layer present: counts

3. Overview of Single-cell Data Structure

To better understand how to work with single-cell/nuclei RNA sequencing data we will take a deep dive into how datasets are typically structured. In the diagram below we illustrate how droplet-based scRNAseq approaches separate cells and their mRNAs. But how does this translate into data?

Generated using biorender.com
Generated using biorender.com

3.1. The Gene x Cell Matrix

# Raw counts in a seurat object is saved in the assays-RNA-counts slot 
raw_counts = as.matrix(seurat_object@assays$RNA@layers$counts)
rownames(raw_counts) = rownames(seurat_object)
colnames(raw_counts) = colnames(seurat_object)

# Lets look at the first 4 columns and rows
print(raw_counts[1:4,1:4])
              AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
AL627309.1                   0                0                0                0
AP006222.2                   0                0                0                0
RP11-206L10.2                0                0                0                0
RP11-206L10.9                0                0                0                0

What does this mean?

Columns: The columns represent a unique barcode. As we can see below, this barcode marks an individual cell which is how we can identify each one. So in this object, the columns are the cells

Generated in biorender.com and adapted from Danielski, K. (2023). Guidance on Processing the 10x Genomics Single Cell Gene Expression Assay. In: Calogero, R.A., Benes, V. (eds) Single Cell Transcriptomics. Methods in Molecular Biology, vol 2584. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2756-3_1
Generated in biorender.com and adapted from Danielski, K. (2023). Guidance on Processing the 10x Genomics Single Cell Gene Expression Assay. In: Calogero, R.A., Benes, V. (eds) Single Cell Transcriptomics. Methods in Molecular Biology, vol 2584. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2756-3_1

Rows: We see that the first few rows have names such as RP11-206L10.9. A quick google search will tell you that this is a human gene (Ensembl ID - ENSG00000237491). So we see in this object that the rows represent the genes.

3.2. Counting the number of cells and genes

Single-cell transcriptomic analysis tools (e.g., Seurat) does a lot of the heavy lifting for us when it comes to counts and other summary data information. For example, the following code will tell you the number of genes and cells, as well store raw, normalized, and scaled data, dimension reductions, and other analyses which may have been performed.

print(seurat_object)
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
 1 layer present: counts

This indicates that we have 13,714 genes (features) across 2,700 samples.

However, we can also extract that information from the count table:

print(paste0('Cells: ', ncol(raw_counts)))
[1] "Cells: 2700"
print(paste0('Features:', nrow(raw_counts)))
[1] "Features:13714"

These numbers should match!

3.3. A deeper dive into counts

3.3a. What is sequencing depth?

# Calculate the total reads in each individual cell
read_depths = colSums(raw_counts)

# How are they distributed?
hist(read_depths, main = "Histogram of read depth", 
     xlab = "Depth", ylab = "Frequency",
     breaks = 200)

This histogram shows that the number of mRNA transcripts detected in each cell ranged from 546 to 15,818, but that most were around 2,000. So what might account for these differences?

  • Cell types: Some cell types may simply just express more mRNA than others. For example, Hepatocytes are known to contain a lot of RNA while others may not.

  • Stochastic gene expression: Fluctuations in transcription and degradation from normal biological processes, cell states, and other factors in individual cells which have very low amounts of RNA to begin can appear as big differences.

  • Technical variability: Sequencing depth, assay kits, and other technical factors may contribute to these differences.

The zero-inflation question: In the early days of scRNAseq it was a widely accepted that data was zero-inflated, i.e., it had more zero’s than it should because of technical challenges. Today, we better understand that the large number of zero’s are not only technical, but in fact does reflect true biology as well. Here are some interesting papers on the topic: Jiang et al., 2022


4. Assessing data quality

Any analysis, no matter of size, should undergo rigorous QC. When it comes to single-cell transcriptomic data, there are several computational tools that help you in the process from enabling visualizations to performing a panel of QC tests and reporting several different metrics for the end user to evaluate. A list of just some of these tools can be found here.

https://www.scrna-tools.org/
https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html

Let’s go through some of the most common QC metrics, what to they mean, and why do they matter?

4.1. Genes and transcripts

How does the number of detected genes and transcripts indicate quality? While we don’t have a known number of expressed genes for each cell type, we do know that cells require the expression of hundreds to thousands of genes to function properly.

Number of genes (features)

We can check these values by running the code below:

VlnPlot(seurat_object, features = 'nFeature_RNA') + NoLegend()
Warning: Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.

Here we can see that (1) the number of genes detected ranges from ~200 - 3000 and (2) most express ~900 genes. Is this normal and expected? In reality, there is no universal threshold, these values will depend on the cell type and the sequencing depth. But we do know that if most cells had extemely low counts that either cells were not healthy/intact or sequencing depth was much too low. It could also point to errors further upstream such as alignment to the incorrect reference genome.

Number of mRNA molecules (count)

We can also examine the total number of mRNA molecules detected:

VlnPlot(seurat_object, features = 'nCount_RNA')
Warning: Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.

As above, we can see a good distribution centered around 2500 mRNA molecules. These value should always be higher than the number of features.

Correlation of features and molecules

A third way to look at this data is to examine the correlation of features and mRNA molecules:

FeatureScatter(object = seurat_object, feature1 = 'nCount_RNA', feature2 = 'nFeature_RNA')

This metric is more informative because it tells you that as you detect more mRNA molecules, you also detect more different genes. This would generally be expected as you are getting a larger sampling of the mRNA resulting in an increased chance of sampling different genes. However, when you assess these plots, you will want to keep the biology in mind.

4.2. Mitochondrial RNA

Here, biological knowledge is extremely important for evaluating this metric and removing cells based on a threshold. Consider these three experiments:

  • Nuclei isolated from muscle cells which require a lot of energy

  • Muscle whole cells which require a lot of energy

  • Skin cells which are less metabolically active

What would you expect to see (qualitative) in the amount of mitochondrial genes for each of these experiments?

Assessing percent mitochondria counts

We can assess the percentage of mitochondrial genes by calculating the number of counts coming from mitochondrial genes (usually prefixed with mt- or MT-) compared to the total counts. This can be visualized like this:

# Calcuate the percentage of mitochondrial counts
seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT-")

# Plot the values
VlnPlot(seurat_object, features = 'percent.mt')
Warning: Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.

We see that the percentage of counts representing mitochondrial genes is mostly below 5% but as high as ~25%. To evaluate these results think back to the three experiments and consider what you would expect to see:

  • Nuclei isolated from muscle cells which require a lot of energy

    Despite being metabolically active, nuclei isolation is expected to wash out mitochondria. High contamination is indicative of high ambient RNA or mitochondria sticking to the the nuclei.

  • Muscle whole cells which require a lot of energy

    Given that muscle cells are highly metabolically active and are known to have among the most mitochondria among mammalian cells, these data may show higher percentage of mitochondrial genes than most datasets.

  • Skin cells which are less metabolically active

    Given that these are less metabolically active, they may be expected to show similar or lower percentage of mitochondrial genes.

Correlation of counts and mitochondrial counts

As with the features and counts, we can look at how these correlate:

FeatureScatter(seurat_object, feature1 = 'nCount_RNA', feature2 = 'percent.mt')

Interestingly, these are not correlated at all. Prior to filtering, these values will be anti-correlated as mitochondria often makes up the ambient RNA (release from dead or lysed cells) and will be more represented in the background (low count cells).

4.3. Ambient RNA contamination

As with any single-cell transcriptomic tool, there are dozens, if not more, options for removal of abient RNA. Generally, the concept for ambient RNA removal is the same - substract genes which are present in the empty GEMs/wells from those which are found to have cells. Other techniques such as combinatorial barcoding approaches have to make different assumptions.

SoupX is one example of a tool which cleans up the soup (floating ambient RNA) from cells. This tools uses the most raw format of data which is generate before a threshold is set on the knee plot discussed above. For 10X Genomics these are stored in the raw_feature_bc_matrix and filtered_feature_bc_matrix folders.

This tutorial will not go through the entire ambient RNA removal workflow. Excellent examples can be found here and here.

4.4 Doublets

Another common challenge with using single-cell transcriptomic platforms is one that can affect most single-cell based techniques including flow cytometry - doublets. These can be a result of cells sticking to each other or being too close to each other during the sorting, and the rate is typically a function of the total number of cells examined (more cells results in more doublets).

Doublets can consist of either 2 cells of the same type (homotypic) or different cell types (heterotypic). As you can probably imagine, distinguising homotypic doublets is more challenging, but are generally considered to be less of a concern.

DoubletFinder and scDblFinder are R based tool for the identification of doublets based on the general principle outlined above.

sce = as.SingleCellExperiment(seurat_object)
Warning: Layer ‘data’ is emptyWarning: Layer ‘scale.data’ is empty
sce <- scDblFinder(sce)
Creating ~5000 artificial doublets...
Dimensional reduction
Evaluating kNN...
Training model...
iter=0, 271 cells excluded from training.
iter=1, 227 cells excluded from training.
iter=2, 209 cells excluded from training.
Threshold found:0.518
109 (4%) doublets called
# Transfer the doublet information to the seurat object
seurat_object$scDblFinder.class = colData(sce)$scDblFinder.class
seurat_object$scDblFinder.score = colData(sce)$scDblFinder.score

4.5. Filtering data

seurat_object <- subset(seurat_object, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5 & scDblFinder.class == 'singlet')

print(seurat_object)
An object of class Seurat 
26286 features across 2529 samples within 2 assays 
Active assay: SCT (12572 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA
 2 dimensional reductions calculated: pca, umap

5. Normalization

seurat_object = SCTransform(seurat_object, verbose = F)
seurat_object = RunPCA(seurat_object, verbose = F)
seurat_object = RunUMAP(seurat_object, dims = 1:30, verbose = F)
DimPlot(seurat_object, group.by='orig.ident')

6. Dimension reduction and clustering; 7. Annotation of cell types; 8. Trajectories and RNA velocity; 9. Inferring cell-cell communication; 10. Spatial transcriptomics
---
title: "Introduction to scRNAseq"
output: html_notebook
---

# Introduction to scRNAseq and data pre-processing

*link to powerpoint slides:* A power point presentation will describe the general principles of single-cell 'omics' and basics about the platforms (\~15 - 20 minutes)

------------------------------------------------------------------------

## 1. FASTQ Files, genome alignments, and deconvolution of cells

This workshop provides an overview of key quality control (QC) metrics and processing steps in the initial steps of single-cell transcriptomic data analysis. While detailed tutorials for processing FASTQ files, performing genome alignments, and deconvoluting cells are available through various resources, we'll focus on understanding the fundamental QC metrics and data structures.

***Links to alignment preliminary analysis tutorials:***

-   Vendor-specific pipeline resources (e.g., [10X Genomics](https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials/cr-tutorial-in))

-   [Galaxy](https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-preprocessing/tutorial.html)

-   Google colab hands-on experience ([*under constructio*n]{.underline})

### 1.1. Understanding the 'knee plot'

Upon receiving aligned single-cell/nuclei sequencing data, one of the initial plots to examine is the 'knee plot,' commonly generated during analysis pipelines. This plot helps distinguish between barcodes representing cells and those representing background noise or empty Gel Emulsion beads (GEMs).

*Gel Emulsion beads (GEMs)* are microfluidic droplets used in platforms like 10X Genomics, each potentially containing a cell or nuclei. Distinguishing between GEMs with cells and empty GEMs is crucial for accurate analysis.

The knee plot visually depicts how a threshold is set to differentiate between cell barcodes and background noise. In platforms like 10X Genomics, cells within GEMs produce a larger number of unique mRNA molecules (UMIs) compared to empty GEMs. The knee plot typically shows a sudden drop (knee) in UMIs, indicating the separation between GEMs containing cells and those lacking them.

[![Figure from Danielski, K. (2023). Guidance on Processing the 10x Genomics Single Cell Gene Expression Assay. In: Calogero, R.A., Benes, V. (eds) Single Cell Transcriptomics. Methods in Molecular Biology, vol 2584. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2756-3_1](images/clipboard-4280699880.png){style="border:1px solid #000000; padding:3px; margin:2px" width="642"}](https://link.springer.com/protocol/10.1007/978-1-0716-2756-3_1)

Interpreting the knee plot involves identifying this knee point. A distinct knee suggests successful separation, while the absence of a clear knee could indicate issues such as low RNA integrity or high ambient RNA contamination.

[![Figure from https://isoseq.how/umi/cell-calling.html](images/clipboard-572921448.png){style="border:1px solid #000000; padding:3px; margin:2px" width="676"}](https://isoseq.how/umi/cell-calling.html)

------------------------------------------------------------------------

## 2. Setting up our R environment for data exploration

In preparation for the following hands-on experiences we will set up an R environment which has everything we need.

### **2.1. Load libraries**

```{r}
suppressPackageStartupMessages({
  library(Seurat)
  library(ggplot2)
  library(tidyr)
  library(scDblFinder)
  library(SingleCellExperiment)
})
```

### **2.2. Import data**

For this tutorial we will use raw unprocessed data from human peripheral blood mononuclear cells (PBMCs) generated using the 10X Genomics platform which can be download from the [Seurat page](https://satijalab.org/seurat/articles/pbmc3k_tutorial). These data should be downloaded, extracted, and placed in the downloads folder for this workshop.

```{r}
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "./downloads/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
seurat_object <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", 
                                    min.cells = 3, min.features = 200)
```

*Note: We apply some preliminary filters as we know that GEMs which very low numbers of genes or genes which are barely detected are unlikely to be actual cells.*

### **2.3. Check that everything was loaded correctly**

```{r}
print(seurat_object)
```

## 3. Overview of Single-cell Data Structure

To better understand how to work with single-cell/nuclei RNA sequencing data we will take a deep dive into how datasets are *typically* structured. In the diagram below we illustrate how droplet-based scRNAseq approaches separate cells and their mRNAs. But how does this translate into [***data***]{.underline}?

![Generated using biorender.com](images/clipboard-2793142743.png){style="border:1px solid #000000; padding:3px; margin:2px" width="403"}

### 3.1. The Gene x Cell Matrix

```{r}
# Raw counts in a seurat object is saved in the assays-RNA-counts slot 
raw_counts = as.matrix(seurat_object@assays$RNA@layers$counts)
rownames(raw_counts) = rownames(seurat_object)
colnames(raw_counts) = colnames(seurat_object)

# Lets look at the first 4 columns and rows
print(raw_counts[1:4,1:4])
```

***What does this mean?***

**Columns:** The columns represent a unique `barcode`. As we can see below, this `barcode` marks an individual cell which is how we can identify each one. So in this object, the columns are the cells

![Generated in biorender.com and adapted from Danielski, K. (2023). Guidance on Processing the 10x Genomics Single Cell Gene Expression Assay. In: Calogero, R.A., Benes, V. (eds) Single Cell Transcriptomics. Methods in Molecular Biology, vol 2584. Humana, New York, NY. <https://doi.org/10.1007/978-1-0716-2756-3_1>](images/clipboard-1606285495.png){style="border:1px solid #000000; padding:3px; margin:2px"}

**Rows:** We see that the first few rows have names such as *RP11-206L10.9.* A quick *google* search will tell you that this is a human gene (Ensembl ID - ENSG00000237491). So we see in this object that the rows represent the genes.

### 3.2. Counting the number of cells and genes

Single-cell transcriptomic analysis tools (*e.g.*, Seurat) does a lot of the heavy lifting for us when it comes to counts and other summary data information. For example, the following code will tell you the number of genes and cells, as well store raw, normalized, and scaled data, dimension reductions, and other analyses which may have been performed.

```{r}
print(seurat_object)
```

This indicates that we have 13,714 genes (features) across 2,700 samples.

However, we can also extract that information from the count table:

```{r}
print(paste0('Cells: ', ncol(raw_counts)))
print(paste0('Features:', nrow(raw_counts)))
```

These numbers should match!

### 3.3. A deeper dive into counts

#### 3.3a. What is sequencing depth?

```{r}
# Calculate the total reads in each individual cell
read_depths = colSums(raw_counts)

# How are they distributed?
hist(read_depths, main = "Histogram of read depth", 
     xlab = "Depth", ylab = "Frequency",
     breaks = 200)
```

This histogram shows that the number of mRNA transcripts detected in each cell ranged from 546 to 15,818, but that most were around 2,000. So what might account for these differences?

-   **Cell types:** Some cell types may simply just express more mRNA than others. For example, Hepatocytes are known to contain a lot of RNA while others may not.

-   **Stochastic gene expression:** Fluctuations in transcription and degradation from normal biological processes, cell states, and other factors in individual cells which have very low amounts of RNA to begin can appear as big differences.

-   **Technical variability:** Sequencing depth, assay kits, and other technical factors may contribute to these differences.

[*The zero-inflation question*]{.underline}*: In the early days of scRNAseq it was a widely accepted that data was zero-inflated, i.e., it had more zero's than it should because of technical challenges. Today, we better understand that the large number of zero's are not only technical, but in fact does reflect true biology as well. Here are some interesting papers on the topic: [Jiang et al., 2022](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02601-5)*

------------------------------------------------------------------------

## 4. Assessing data quality

Any analysis, no matter of size, should undergo rigorous QC. When it comes to single-cell transcriptomic data, there are several computational tools that help you in the process from enabling visualizations to performing a panel of QC tests and reporting several different metrics for the end user to evaluate. A list of just some of these tools can be found [here](https://www.scrna-tools.org/tools?sort=name&cats=QualityControl).

```         
https://www.scrna-tools.org/
https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html
```

Let's go through some of the most common QC metrics, what to they mean, and why do they matter?

### 4.1. Genes and transcripts

*How does the number of detected genes and transcripts indicate quality?* While we don't have a known number of expressed genes for each cell type, we do know that cells require the expression of hundreds to thousands of genes to function properly.

**Number of genes (features)**

We can check these values by running the code below:

```{r}
VlnPlot(seurat_object, features = 'nFeature_RNA') + NoLegend()
```

Here we can see that (1) the number of genes detected ranges from \~200 - 3000 and (2) most express \~900 genes. *Is this normal and expected?* In reality, there is no universal threshold, these values will depend on the cell type and the sequencing depth. But we do know that if most cells had extemely low counts that either cells were not healthy/intact or sequencing depth was much too low. It could also point to errors further upstream such as alignment to the incorrect reference genome.

**Number of mRNA molecules (count)**

We can also examine the total number of mRNA molecules detected:

```{r}
VlnPlot(seurat_object, features = 'nCount_RNA')
```

As above, we can see a good distribution centered around 2500 mRNA molecules. These value should always be higher than the number of features.

**Correlation of features and molecules**

A third way to look at this data is to examine the correlation of features and mRNA molecules:

```{r}
FeatureScatter(object = seurat_object, feature1 = 'nCount_RNA', feature2 = 'nFeature_RNA')
```

This metric is more informative because it tells you that as you detect more mRNA molecules, you also detect more different genes. This would generally be expected as you are getting a larger sampling of the mRNA resulting in an increased chance of sampling different genes. However, when you assess these plots, you will want to [**keep the biology in mind**]{.underline}.

### 4.2. Mitochondrial RNA

Here, biological knowledge is extremely important for evaluating this metric and removing cells based on a threshold. Consider these three experiments:

-   Nuclei isolated from muscle cells which require a lot of energy

-   Muscle whole cells which require a lot of energy

-   Skin cells which are less metabolically active

*What would you expect to see (qualitative) in the amount of mitochondrial genes for each of these experiments?*

**Assessing percent mitochondria counts**

We can assess the percentage of mitochondrial genes by calculating the number of counts coming from mitochondrial genes (usually prefixed with mt- or MT-) compared to the total counts. This can be visualized like this:

```{r}
# Calcuate the percentage of mitochondrial counts
seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT-")

# Plot the values
VlnPlot(seurat_object, features = 'percent.mt')
```

We see that the percentage of counts representing mitochondrial genes is mostly below 5% but as high as \~25%. To evaluate these results think back to the three experiments and consider what you would expect to see:

-   Nuclei isolated from muscle cells which require a lot of energy

    Despite being metabolically active, nuclei isolation is expected to wash out mitochondria. High contamination is indicative of high ambient RNA or mitochondria sticking to the the nuclei.

-   Muscle whole cells which require a lot of energy

    Given that muscle cells are highly metabolically active and are known to have among the most mitochondria among mammalian cells, these data may show higher percentage of mitochondrial genes than most datasets.

-   Skin cells which are less metabolically active

    Given that these are less metabolically active, they may be expected to show similar or lower percentage of mitochondrial genes.

**Correlation of counts and mitochondrial counts**

As with the features and counts, we can look at how these correlate:

```{r}
FeatureScatter(seurat_object, feature1 = 'nCount_RNA', feature2 = 'percent.mt')
```

Interestingly, these are not correlated at all. Prior to filtering, these values will be anti-correlated as mitochondria often makes up the ambient RNA (release from dead or lysed cells) and will be more represented in the background (low count cells).

### 4.3. Ambient RNA contamination

As with any single-cell transcriptomic tool, there are dozens, if not more, options for removal of abient RNA. Generally, the concept for ambient RNA removal is the same - substract genes which are present in the *empty* GEMs/wells from those which are found to have cells. Other techniques such as combinatorial barcoding approaches have to make different assumptions.

![<https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html>](images/clipboard-2328202483.png){width="643"}

[SoupX](https://academic.oup.com/gigascience/article/9/12/giaa151/6049831) is one example of a tool which cleans up the *soup* (floating ambient RNA) from cells. This tools uses the most raw format of data which is generate before a threshold is set on the knee plot discussed above. For 10X Genomics these are stored in the `raw_feature_bc_matrix` and `filtered_feature_bc_matrix` folders.

This tutorial will not go through the entire ambient RNA removal workflow. Excellent examples can be found [here](https://cellgeni.github.io/notebooks/html/new-10kPBMC-SoupX.html) and [here](https://cran.r-project.org/web/packages/SoupX/vignettes/pbmcTutorial.html).

### 4.4 Doublets

Another common challenge with using single-cell transcriptomic platforms is one that can affect most single-cell based techniques including flow cytometry - doublets. These can be a result of cells sticking to each other or being too close to each other during the *sorting*, and the rate is typically a function of the total number of cells examined (more cells results in more doublets).

![<https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html>](images/clipboard-384387352.png){width="1139"}

Doublets can consist of either 2 cells of the same type (homotypic) or different cell types (heterotypic). As you can probably imagine, distinguising homotypic doublets is more challenging, but are generally considered to be less of a concern.

[DoubletFinder](https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30073-0) and [scDblFinder](https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30073-0) are R based tool for the identification of doublets based on the general principle outlined above.

```{r}
sce = as.SingleCellExperiment(seurat_object)
sce <- scDblFinder(sce)
```

```{r}
# Transfer the doublet information to the seurat object
seurat_object$scDblFinder.class = colData(sce)$scDblFinder.class
seurat_object$scDblFinder.score = colData(sce)$scDblFinder.score

```

### 4.5. Filtering data

```{r}
seurat_object <- subset(seurat_object, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5 & scDblFinder.class == 'singlet')

print(seurat_object)
```

## 5. Normalization

```{r}
seurat_object = SCTransform(seurat_object, verbose = F)
seurat_object = RunPCA(seurat_object, verbose = F)
seurat_object = RunUMAP(seurat_object, dims = 1:30, verbose = F)
```

```{r}
DimPlot(seurat_object, group.by='orig.ident')
```

```         
6. Dimension reduction and clustering; 7. Annotation of cell types; 8. Trajectories and RNA velocity; 9. Inferring cell-cell communication; 10. Spatial transcriptomics
```
